About the project

https://github.com/villetaro/IODS-project

Moving forward with analysis!


Chapter 2: Data analysis with R and regression model

learning2014 <- read.table(file="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header=T, sep=",")

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

There is about twice as much females comparered to men. Age ranges from 17 to 55 with mean being 25.51. This gives us a big range in the scale of age. The variable points has a mean of 22.72 with the minimum being 7.00 and the maximum being 33.0.

Now we can also make a graphical overview of the data with ggplot2.

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The grapical overview could have been a little bit better. I deciced to highlight the gender as I think it was the most important variable in the model.

From the graph we can see the correlations between the different variables. The highest correlation is between attitude and points. The lowest correlation is a close battle with age - attitude and age - points. Overall the variable age seems to have quite low correlations compared to the others.

malli <- lm(points ~ attitude + stra + surf, data=learning2014)
summary(malli)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The only variable with enough significance in p-value is attitude. This is the reason why we remove stra and surf variables from the model. Now we fit it again.

malli <- lm(points ~ attitude, data=learning2014)
summary(malli)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The multiple R-squared refers to the usual R^2. R^2 can be interpreted as the fraction of the variance of the result variable that is explained by the explanatory variables. In the model R^2 is 0.1906 which is quite low so one could say that it doesn’t explain the result variable too well.

library(ggplot2)
plot(malli, which=c(1, 2,5), col = "turquoise2",  lwd = 3)

“Residuals vs Fitted”-plot shows whether the residuals have non-linear patterns. Comparing this picture with the models first three assumptions we seem to have a few outliers. Otherwise everything looks good.

“Normal Q-Q”-plot shows whether the residuals are normally distributed. This means we can see how well the theoretical residuals correspond to the actual ones.

“Residuals vs Leverage” helps us to look for the outliers.


Chapter 3: Logistic regression

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt ", header=T, sep=",")

dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data has 382 observations and 35 variables after we did the following changes: - The variables not used for joining the two data have been combined by averaging (including the grade variables) - ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ - ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

The data is about students alcohol consumption.

I’m going to conduct an analysis on relationships between high/low alcohol consumption and sex, age and the grades G1, G2 and G3.

My hypothesis were as follows:

H1: Lower grades lead to higher alcohol consumption. H2: Male students drink more than female students. H3: The age variable doesn’t have a lot of effect.

Drawing the plots and figuring out how my hypotheses begins to look.

library(ggplot2)
p1 <- ggplot(alc, aes(x = high_use, y = G1, col = sex)) + geom_boxplot() + ylab("first period grade")

p2 <- ggplot(alc, aes(x = high_use, y = G2, col = sex)) + geom_boxplot() + ylab("second period grade")

p3 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("final grade")

p4 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age")

multiplot(p1, p2, p3, p4, cols = 2)

We begin from the top left corner. The grades go from 0 to 20 and alcohol usage has a binary (0,1) value. The binary values is presented in false and high. There we see that first period grades are a little bit better when high alcohol usage gets the false value. we also see that the tails are long which means also higher outliers in the model.

The second picture in the left bottom corner tells quite the same story as the picture above but now y-scale is second period grade. Quite interesting is that the tails in the lower end of grade scale are high as there a dots in the grade = 0. That means outliers again.

The third picture in the top right corner has final grade and alcohol consumption. We don’t really see any large changes so let’s move on.

The fouth and final picture in the bottom right corner is very interesting. First of all we see that the age is quite low in the high alcohol usage for example compared to Finland. The mean value in female group is 16 and 17 in the male group. That is why it is also smart to look at the age factor in the whole data. It looks like this:

summary(alc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   16.00   17.00   16.59   17.00   22.00

Now lets make a model based on the hypotheses above. I’m trying to explain high usage of alcohol with grades, sex and age. I’m using logistic regression model below.

model2 <- glm(high_use ~ G1 + G2 + G3 + sex + age, family = "binomial", data = alc)

summary(model2)
## 
## Call:
## glm(formula = high_use ~ G1 + G2 + G3 + sex + age, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5220  -0.8417  -0.6686   1.1570   2.1413  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.895489   1.807142  -2.156   0.0311 *  
## G1          -0.178473   0.073996  -2.412   0.0159 *  
## G2           0.004475   0.088770   0.050   0.9598    
## G3           0.081996   0.061465   1.334   0.1822    
## sexM         0.962330   0.239154   4.024 5.72e-05 ***
## age          0.212090   0.103040   2.058   0.0396 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 433.69  on 376  degrees of freedom
## AIC: 445.69
## 
## Number of Fisher Scoring iterations: 4

Now lets look at the results of model. To me it seems like the p-value is quite low and it indicates that only the variable sex is significant. Age and the grade variables have a much higher p-value which means that they are not that significant at all. This could be because the model could also be used the otherway: explaining grades with high usage of alcohol. This is a valid point but with logistic regression we cannot easily make model this way so we leave it for another time.

Now we calculate odd raiots and their CIs.

OR <- coef(model2) %>% exp
CI <- confint(model2) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
##                     OR        2.5 %    97.5 %
## (Intercept) 0.02033343 0.0005560289 0.6762128
## G1          0.83654645 0.7201998351 0.9632938
## G2          1.00448490 0.8433762954 1.1972492
## G3          1.08545156 0.9673539597 1.2340260
## sexM        2.61778817 1.6469686500 4.2128194
## age         1.23625854 1.0123920001 1.5178536

Looking at the odd ratios the first thing we see that the risk of high use was about 2.6 times more likely in males than females. The next thing is that G1 gets quite low value as does the other grade variables G2 and G3.

I decided to leave out G2, G3 and age as they were not significant in the model. Then i made a new model below. The G1 could’ve also been left out but I wanted to keep one grade in the model as I first had the all three with.

model3 <- glm(high_use ~ G1 + sex, family = "binomial", data = alc)

summary(model3)
## 
## Call:
## glm(formula = high_use ~ G1 + sex, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3037  -0.8390  -0.6862   1.2508   2.0911  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.39990    0.39690  -1.008  0.31368    
## G1          -0.09263    0.03590  -2.580  0.00987 ** 
## sexM         0.96992    0.23735   4.086 4.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 440.48  on 379  degrees of freedom
## AIC: 446.48
## 
## Number of Fisher Scoring iterations: 4

The result completely different now than compared to the first and G1 got the second *. It is looking a lot more significant now.

Also the intercept rose from -3.895489 to -0.39990 which is a huge change.

OR <- coef(model3) %>% exp
CI <- confint(model3) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 0.6703900 0.3065190 1.457539
## G1          0.9115336 0.8484642 0.976978
## sexM        2.6377259 1.6654805 4.230000

Second look at the odds ratios which look quite the same as the first time. Sex still has the highest prediction power.

I drew a crosstab of probabilities. It tells that about 68% out of 92% of times the model predicted correctly low use and about 5% out of 8% the model was correct in trying to predict high alcohol usage.

probabilities <- predict(model3, type = "response")

alc <- mutate(alc, probability = probabilities)


alc <- mutate(alc, prediction = probabilities > 0.5)

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.69109948 0.01570681 0.70680628
##    TRUE  0.26439791 0.02879581 0.29319372
##    Sum   0.95549738 0.04450262 1.00000000

Then I calculated the penalty/loss rate and this tells how many times the model classified observations incorrectly. The average number of incorrect predictions was 0.2801047.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2801047

The last step was to cross validate my model against randomly allocated parts of the alc data. I used a 10-fold cross-validation. The average amount of wrong preditions is around 30 which is higher than the one I got above

library(boot)
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model3, K = 10)

# average number of wrong answers
cv$delta[1]
## [1] 0.2905759

Chapter 4: Clustering and stuff’

We begin this week by analysing data from R’s MASS package. The data contains information about The Boston area house-prices.

data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The data has 506 observations and 14 different variables. Next we plot the data to show the graphical overview and then we check how different variables compare to each other.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

This week we focus on the crim variable.

cor_matrix<-cor(Boston) 


cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
wb <- c("white","black") 

corrplot(cor_matrix, order="hclust", addrect=2, bg="gold2", col=wb)

This is a correlation plot which explain the correlations of different variables when being compared to each other. The bigger the spot the stronger the correlation.

One of the biggest one in the positive side is rad - tax. From the negative correlation side the biggest ones are dis - indus, dis - nox and dis - age.

Next we standardize the data.

myboston_scaled <- scale(Boston)
summary(myboston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

After we standardized all the means are now 0.

myboston_scaled <- as.data.frame(myboston_scaled)
cutoffs <- quantile(myboston_scaled$crim)

labels <- c("low", "med_low", "med_high", "high")

crime_category <- cut(myboston_scaled$crim, breaks = cutoffs, include.lowest = TRUE, label = labels)

table(crime_category)
## crime_category
##      low  med_low med_high     high 
##      127      126      126      127

Now we drop the old crime rate variable from the dataset. Then we divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

myboston_scaled <- dplyr::select(myboston_scaled, -crim)
myboston_scaled <- data.frame(myboston_scaled, crime_category)
str(myboston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn            : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus         : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas          : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox           : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm            : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age           : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis           : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad           : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax           : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio       : num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black         : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat         : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv          : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime_category: Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
n <- nrow(myboston_scaled)
eighty_perc <- sample(n, size = n * 0.8)
train <- myboston_scaled[eighty_perc,]
test <- myboston_scaled[-eighty_perc,]

Now we fit the linear discriminant analysis on the train set. We use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. This gives us an idea how crime rate interacts with the other variables.

lda.fit <- lda(crime_category ~ ., data = train)
lda.fit
## Call:
## lda(crime_category ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2648515 0.2475248 0.2425743 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       1.0063134 -0.9434110 -0.15302300 -0.9002885  0.53229839
## med_low  -0.1023478 -0.3401848  0.02203357 -0.5695235 -0.09133914
## med_high -0.3654691  0.1129711  0.23949396  0.3784589  0.11472196
## high     -0.4872402  1.0171960 -0.03128211  1.1035462 -0.42042768
##                 age        dis        rad        tax     ptratio
## low      -0.9053294  0.9053304 -0.6941744 -0.7267217 -0.50948550
## med_low  -0.3407832  0.3382362 -0.5460977 -0.5336951 -0.06795662
## med_high  0.4397371 -0.3580810 -0.4593187 -0.3632859 -0.34852313
## high      0.8031860 -0.8537761  1.6373367  1.5134896  0.77985517
##                black        lstat        medv
## low       0.38204776 -0.810525540  0.63277733
## med_low   0.32626321 -0.171141036  0.04134756
## med_high  0.07689883 -0.009629274  0.16224604
## high     -0.83107197  0.857043740 -0.71885429
## 
## Coefficients of linear discriminants:
##                 LD1         LD2           LD3
## zn       0.09075337  0.62970202 -0.8658488382
## indus    0.11040603 -0.14073406  0.4189764277
## chas    -0.08896632 -0.07352423  0.1307929062
## nox      0.29547567 -0.71672518 -1.3861016138
## rm      -0.12379427 -0.08548805 -0.1687279192
## age      0.14478886 -0.41348712 -0.2361745958
## dis     -0.07798786 -0.14848977 -0.0007502371
## rad      3.57709743  1.00578944  0.3143411690
## tax      0.09900658 -0.03434703  0.0626615756
## ptratio  0.09321830  0.01793973 -0.2429099947
## black   -0.09271723  0.04726208  0.1815234149
## lstat    0.20939863 -0.19743785  0.3769640509
## medv     0.16778408 -0.26372648 -0.2245855580
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9554 0.0337 0.0109

Now we draw the LDA biplot

lda.arrows <- function(x, myscale = 0.5, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime_category)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Now we predict the classes with the LDA model on the test data. We can help our analysis by cross tabulating the results with the crime categories from the test set.

correct_classes <- test$crime_category
test <- dplyr::select(test, -crime_category)

We see that in our model the first three groups (low, med_low and med_high) are very close to each other. The grouping is not that obvious and could use some more thought.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      10        2    0
##   med_low    3      11        5    0
##   med_high   0       6       17    3
##   high       0       0        0   29

Let’s see how well the model actually worked. I made a little observation during this analysis that the model changes everytime when I ran the code as some of the numbers come from random variables. As you read this the number may look a little different but I try to get the overall look.

The number of total observations

total <- c(13+10+6+11+7+12+18+25)
total
## [1] 102

We have 102 observations.

The number of correct predictions in this instance.

correct <- c(13+18+11+25)
correct
## [1] 67

We had 67 correctly predicted observations.

wrong <- c(102-67)
wrong
## [1] 35

And 35 of the observations were incorrectly predicted.

The accuracy of our model is the following:

ratio <- c(correct/total)
ratio
## [1] 0.6568627

At this time the model did ok but as I ran the code a couple of times the accuracy was quite different (0.66 as I write this).

Now we reload the Boston dataset and standardize it. Then we take a look at the distances between the variables.

New_Boston <- Boston

str(New_Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
New_Boston_scaled <- scale(New_Boston) %>% as.data.frame()

str(New_Boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...

Now we calculate the distances.

# Euclidean distance matrix using set.seed()
set.seed(123)
dist_eu <- dist(New_Boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000

This is based on the Euclidean measure. And above you see the summary of the findings. Now we can start the clustering with the amount of 10 clusters. 10 is a safe bet that nothing would go wrong, below I calculate the optimum amount of clusters.

library(ggplot2)

km <-kmeans(dist_eu, centers=10)
pairs(New_Boston_scaled, col = km$cluster)

As there is 10 clusters at the moment it is really hard to see what actually happens in this plot.

Because of this we want to optimize the amount of clusters so we don’t have to guess it and then we can take a look at the plot again..

k_max <- 10
wss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

It seems that the optimal amount of clusters is 2. So we run the cluster analysis again and then take a look at the plot.

km <-kmeans(dist_eu, centers=2)
pairs(New_Boston_scaled, col = km$cluster)

The above picture is quite similar compared to the one a moment ago but now the clusters are also included. The variables tax and rad seem to work the best as in the differences are most clear compared to the plot without the clusters.

Next we perform LDA using the clusters as target classes. We include all the variables in the Boston data in the LDA model.

km <-kmeans(dist_eu, centers = 3)
lda.fits <- lda(km$cluster~., data = myboston_scaled)
lda.fits
## Call:
## lda(km$cluster ~ ., data = myboston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.1383399 0.2569170 0.6047431 
## 
## Group means:
##           zn      indus        chas        nox         rm        age
## 1 -0.4321124  1.0399621  0.62757956  1.2667725 -0.5975497  0.8264008
## 2 -0.4872402  1.0743771 -0.18147291  0.7905978 -0.3261484  0.7799367
## 3  0.3058467 -0.6943345 -0.06646762 -0.6256594  0.2752542 -0.5203916
##          dis        rad        tax    ptratio        black      lstat
## 1 -0.9294285  1.1838094  1.2037534  0.2897641 -1.547587007  1.0514755
## 2 -0.7234163  0.7284612  0.8572944  0.5527964  0.004362524  0.6244404
## 3  0.5199481 -0.5802831 -0.6395785 -0.3011341  0.352169811 -0.5058188
##         medv crime_categorymed_low crime_categorymed_high
## 1 -0.6373069            0.01428571              0.2000000
## 2 -0.5333324            0.10000000              0.3230769
## 3  0.3723683            0.36601307              0.2287582
##   crime_categoryhigh
## 1          0.7714286
## 2          0.5615385
## 3          0.0000000
## 
## Coefficients of linear discriminants:
##                                LD1         LD2
## zn                     -0.39468364 -0.09677448
## indus                  -1.36978210 -1.03566594
## chas                   -0.29441343  0.53112017
## nox                    -0.54054935  0.56258301
## rm                      0.18662157 -0.41617536
## age                    -0.18175313 -0.30440972
## dis                    -0.08851529 -0.03497269
## rad                    -0.30229151 -0.04071641
## tax                    -0.35134096  0.17638898
## ptratio                -0.10630867 -0.07487010
## black                   0.47176537 -1.09887695
## lstat                  -0.40545791  0.29469437
## medv                   -0.31423331  0.48310594
## crime_categorymed_low   0.61873110 -0.06078345
## crime_categorymed_high  0.09629715 -0.73535354
## crime_categoryhigh     -0.08179388 -0.84651510
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9151 0.0849

Now we visualize the results with a biplot and then analyze the results.

plot(lda.fits, dimen = 2, col = classes) 
lda.arrows(lda.fits, myscale = 1)

Using 3 clusters for K-means analysis, the most influencial linear separators are nox and chas. This gives us an idea that they would be the best variables to predict new observations if those would appear.

The last thing to do is 3d modeling. It really tops this weeks project as it gives us a really clear model that we can also move 360 degrees. It helps us analyse the findings and understand the model.

model_predictors <- dplyr::select(train, -crime_category)


dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes, 
        colors=c("blue","yellow") )

Chapter 5: Dimensionality reduction techniques

Lets begin the Chapter 5 by loading the data

human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", header = T, sep=",")

dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Next we look at the graphical overview of the data.

p <- ggpairs(human, mapping = aes(col="blue", alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

p

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
p2 <- ggplot(human, aes(x=GNI, y= Life.Exp)) + geom_point(col="deepskyblue1") + geom_smooth( col = "red2")

p2
## `geom_smooth()` using method = 'loess'

p3 <- ggplot(human, aes(x=GNI, y= Edu.Exp)) + geom_point(col="red2") + geom_smooth(col = "royalblue1")

p3
## `geom_smooth()` using method = 'loess'

Next we perform principal component analysis (PCA) on the not standardized human data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.5, 1), col = c("black", "red2"), main="Biplot of the first two principal components for the unscaled data")

The picture doesn’t look very good as we have not scaled the data. This is the reason why it is important to scale the data.

human_std <- scale(human)

pca_human11 <- prcomp(human_std)

summary(pca_human11)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_human11, choices = 1:2, cex = c(0.6, 1.2), col = c("black", "red2"),main="Biplot of the first two principal components for the scaled data")

Next we load the Tea dataset from Factominer package.

data("tea")
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

I decided to drop out some variables as there was 36 of them. It was quite impossible to plot and make a grapical overview with a such variety of different variables.

keep <- c("friends", "Tea", "resto", "age_Q", "frequency")

teatimes <- dplyr::select(tea, one_of(keep))

str(teatimes)
## 'data.frame':    300 obs. of  5 variables:
##  $ friends  : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ Tea      : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ resto    : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ age_Q    : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency: Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
g1 <- ggplot(teatimes, aes(friends)) + geom_bar() + theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g2 <- ggplot(teatimes, aes(Tea)) + geom_bar()+ theme_grey() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g3 <- ggplot(teatimes, aes(resto)) + geom_bar() + theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g4 <- ggplot(teatimes, aes(age_Q)) + geom_bar()+ theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g5 <- ggplot(teatimes, aes(frequency)) + geom_bar()+ theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))

multiplot(g1, g2, g3, g4, g5, cols=3)

mca <- MCA(X = teatimes, graph=F)

summary(mca)
## 
## Call:
## MCA(X = teatimes, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.298   0.270   0.235   0.217   0.210   0.195
## % of var.             13.547  12.257  10.703   9.861   9.545   8.847
## Cumulative % of var.  13.547  25.804  36.508  46.368  55.913  64.760
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.187   0.175   0.161   0.131   0.122
## % of var.              8.511   7.949   7.313   5.933   5.533
## Cumulative % of var.  73.271  81.221  88.534  94.467 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           |  0.875  0.856  0.274 |  0.254  0.080  0.023 | -0.006  0.000
## 2           |  1.003  1.126  0.443 | -0.075  0.007  0.003 | -0.145  0.030
## 3           | -0.368  0.151  0.074 |  0.427  0.225  0.099 | -0.392  0.217
## 4           |  0.041  0.002  0.001 | -0.727  0.654  0.367 |  0.050  0.004
## 5           |  0.421  0.198  0.110 | -0.185  0.042  0.021 |  0.041  0.002
## 6           |  0.041  0.002  0.001 | -0.727  0.654  0.367 |  0.050  0.004
## 7           | -0.160  0.029  0.008 |  0.350  0.152  0.039 |  0.266  0.100
## 8           |  0.578  0.374  0.095 |  0.375  0.174  0.040 |  0.731  0.756
## 9           |  0.293  0.096  0.040 |  0.145  0.026  0.010 |  0.180  0.046
## 10          |  0.727  0.590  0.201 |  0.675  0.564  0.173 |  0.323  0.147
##               cos2  
## 1            0.000 |
## 2            0.009 |
## 3            0.084 |
## 4            0.002 |
## 5            0.001 |
## 6            0.002 |
## 7            0.022 |
## 8            0.152 |
## 9            0.015 |
## 10           0.040 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## friends     |  -0.462   9.354   0.402 -10.965 |   0.139   0.942   0.037
## Not.friends |   0.870  17.629   0.402  10.965 |  -0.263   1.776   0.037
## black       |   0.686   7.800   0.154   6.792 |   1.123  23.080   0.413
## Earl Grey   |  -0.498  10.712   0.448 -11.568 |  -0.254   3.071   0.116
## green       |   1.374  13.936   0.233   8.353 |  -1.035   8.740   0.132
## Not.resto   |   0.216   2.308   0.131   6.249 |  -0.312   5.325   0.273
## resto       |  -0.604   6.456   0.131  -6.249 |   0.873  14.896   0.273
## 15-24       |  -0.800  13.182   0.283  -9.204 |  -0.412   3.866   0.075
## 25-34       |  -0.170   0.447   0.009  -1.608 |  -0.373   2.378   0.042
## 35-44       |   0.290   0.753   0.013   1.967 |   0.760   5.707   0.089
##              v.test     Dim.3     ctr    cos2  v.test  
## friends       3.310 |   0.059   0.191   0.006   1.394 |
## Not.friends  -3.310 |  -0.111   0.361   0.006  -1.394 |
## black        11.114 |   0.228   1.091   0.017   2.258 |
## Earl Grey    -5.891 |  -0.117   0.748   0.025  -2.717 |
## green        -6.292 |   0.172   0.278   0.004   1.048 |
## Not.resto    -9.029 |   0.321   6.453   0.289   9.288 |
## resto         9.029 |  -0.898  18.053   0.289  -9.288 |
## 15-24        -4.741 |   0.723  13.606   0.231   8.312 |
## 25-34        -3.528 |  -1.211  28.641   0.438 -11.443 |
## 35-44         5.152 |   0.241   0.659   0.009   1.636 |
## 
## Categorical variables (eta2)
##               Dim.1 Dim.2 Dim.3  
## friends     | 0.402 0.037 0.006 |
## Tea         | 0.484 0.470 0.025 |
## resto       | 0.131 0.273 0.289 |
## age_Q       | 0.403 0.298 0.523 |
## frequency   | 0.071 0.271 0.335 |
plot(mca, invisible=c("ind"), habillage = "quali")